How to clean & filter dataset with Python.


When it comes to prediction phase in datascience and machine learning routine you may discover that prediction results are not so good as were expected. The possible reason in most cases that obtained raw data affected by other reasons than subject you study and dataset itself shoud be cleaned or filtered from outliers before prediction process.

To confirm presence of outliers data can be visualised. The Python script below is simple, but still usefull in most of the cases. The main idea is to filter outliers using quantiles metrics, coefficients inside script can be adjusted for each dataset individually.

This dataset outliers filtering method is well known as trimming method and have one possible disadvantage - can filter non-outliers together with outliers in some cases depending on original raw data.

More advanced and smart dataset cleaning and filtering techniques can be found in in statistics section, links are available below.




import numpy as np
import pandas as pd

dataset=pd.read_excel('rawdata.xlsx')

q_low = dataset["Target"].quantile(0.25)
q_hi  = dataset["Target"].quantile(0.75)

q = q_hi - q_low

df_filtered = dataset[(dataset["Target"] < (q_hi + 1.5 * q)) & (dataset["Target"] > (q_low - 1.5 * q))]

df_filtered.to_excel('df_filtered.xlsx')   


See also related topics: