05 Sep 2022

Spline Fitting

I reprocessed the data such that the spaces between neighboring data points are more consistent and then ran the spline through the data points
I went back to using the univariate spline. Played around a bit with the attributes within univariate spline, especially the smoothing factor (s) and k, the degree of the smoothing spline
- Other attributes (like changing the number of knots), didn't have as large of an effect as the smoothing factor for some reason
For the graph above, only got two spline knot locations.
Found this useful website that explains splines and knots a bit more clear to me:
https://stats.stackexchange.com/questions/517375/splines-relationship-of-knots-degree-and-degrees-of-freedom

When s= 0, we get a nice spline fit, but it looks pretty choppy.
but taking the automatic spline smoothing factor gives the first graph (where the spline is not a strong fit)

Setting k = 5 automatically gave a much better fit, which is expected because polynomials (if the degree is high enough) can fit data points more accurately usually
Printing out the coefficients of the spline function, not 100% sure how to read it, and then also printing the locations of the knots

Applied the function to other sets of data, a bit too choppy
I set the smoothing factor to 10 and then kept k at 10. I spaced out the points by taking every point that is at least 1/3rd of the greatest interval width in the data set, changed to 1/2

From these sample fits, I think the spline fit is more accurate than the original ones, although some may still need to be improved.

Previously, I had created code that plotted equivalent widths / airmasses of particular air molecules ( H\alpha, H\beta, H2O, O2) for all dates or all stars to see if those air molecules are independent of the star or the date of observation, as expected.
The code did not account for the different gratings through which the molecule eq widths were measured, so I made some adjustments to separate the files based on gratings to see if the independence of molecules against stars/dates was stronger.
The code still outputs the best linear fit through the lines, r^2 of lin fit, and covariance coeffs:

There are definitely some significant outliers that appear to be skewing the data, which is interesting because I have already sigma clipped the data through five iterations, maybe need more iterations
also, I think the linear fit program I use tries not to use zero as a potential slope, so I can look into a linear fit that allows a zero slope (potentially)

I tried increasing iterations to get rid of noticeable outliers (like the one in the chart below); the outlier in the top right corner remained the same. Also tried keeping default settings where the data keeps getting sigma clipped until data converges
Below is when I sigma-clipped for a max of 100 iterations, and the outlier still stayed

Instead of doing 5 - sigma clipping, for this particular data set, I changed it to 4 sigma clipping which noticeably removed the outlier in the top right corner.
However, we see that the linear fit for the data does not pass through the points, so something was wrong with my linear fit

Found out that the linear fit I was using fitted the model through the clipped data points (since I was using a masked array) so I copied the elements that were not masked over into a new array (this is very inefficient in running time, can ask Eske about better ways to plot masked arrays / fitting through masked arrays when he gets back). So, for the same plot, got a flatter, better fit

While I was figuring out what was wrong with the linear fit I came across a built-in p-value attribute into the scipy.stats.linregress that I'm using, which gives you the p-value for the null hypothesis that the slope is 0 for the data points. Took the p-values and said if they were greater than 0.05, the null hypothesis is most probably true.
Printed out the p-values for the linear fits and what conclusions can be made from them:

\

Rerunning the independence tests, the slopes look much more horizontal, and the p-value confirms that most slopes are not statistically significant enough to not be 0

Week 8

09/09/2022

Chi-2 quality histogram

looking at the provided chi2 values in the header of the data file based on the type of grating that was used to collect the data
If the values are much different than 1, we cannot take the file data
I extracted the chi2 values from all the different headers of the data file and created a histogram of them
One of the files had a chi2 value of 400+, which I removed because it was skewing the data. This file "spec_data_2022062800311.fits" was previously marked on the faulty_files.csv for having no O2 data, but we also see that its spectrum is messed up
honestly the provided chi-2 values look pretty high, next time I can see what the spectrum graphs look like / how the individual molecule graphs look like for the data files that have a chi2 value that is greater than a certain threshhold
also printed out all the chi2 values