03 Data Cleaning and EDA

21 Mar 2023 in Notes / Dataanalytics / Datascience

Data Cleaning ⇔ EDA
- Data Cleaning
- Exploratory Data Analysis (EDA)
Key Data Properties of EDA
Summary

lifecycle

Data Cleaning ⇔ EDA

infinite loop of data science

Data Cleaning

transforming raw data to facilitate subsequent analysis
often addresses issues:
- structure / formatting (re)
- missing or corrupted values (na, null)
- unit conversion
- encoding text as numbers …etc

Exploratory Data Analysis (EDA)

Getting to \(know\) the data

key of Data Science
transforming, visualizing, summarizing
data in order to:
- build / confirm understanding of the data and its provenance
- identify / address potential issues in data (\(\rightarrow\) domain knowledge required)
- inform the subsequent analysis
- discover potential hypothesis (becareful)
open-ended analysis (be willing to find sth surprising)

Key Data Properties of EDA

Structure

the shape of the data file

RECTANGULAR DATA

preferable
1. easy to manipulate & analyze
2. big part of data cleaning: transforming data to be more rectangular
Types of rectanglar data:
1. Tables (dataframes, relations)
  - names columns with different types
  - manipulated using data transformation langauges (map, filter, group by, join …)
2. Matrices
  - numeric data of the same type
  - manipulated using linear algebra

DATA FILES

Tabular data
- \(TSV\): Tab seperated values
- \(CSV\): Comma seperated values
- common!
- issues: how to deal with commas, tabs in records & quoting
\(JSON\): JavaScript Object Notation
- widely used file format for nested data
- similart to python dictionaries
- issues:
  - not rectangular
  - each record can have different fields
  - record can contain tables \(\rightarrow\) complicated

\(XML\) : Extensible Markup Language

nested data

example

<catalog>
  <plant type="a">
    <common>Bloodroot</common>
    <botanical>Sanguinaria canadensis</botanical>
    <zone>4</zone>
    <light>Mostly Shady</light>
    <price>2.44</price>
    <availability>03/15/2006</availability>
    <description>
      <color>white</color>
      <petals>true</petals>
    </description>
    <indoor>true</indoor>
  </plant>
  …
</catalog>

\(Log Data\)

example

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET
/stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

169.237.6.168 - - [8/Jan/2014:10:47:58 -0800] "GET
/stat141/Winter04/ HTTP/1.1" 200 2585
"http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

KEYS AND JOINS

useful when referencing other pieces of data
Primary Key: column / set of columns that determine the values of remaining columns
- unique (usually ID)
Foreign Key: column (set of columns) that reference primary keys in other tables
join required

Questions to ask about STRUCTURE:

데이터 사용 전 체크리스트!

Are the data in standard format or encoding:
- Tabular Data: CSV, TSV, Excel, SQL
- Nested Data: JSON, XML
Are the data organized in ‘records’?
- if no: can we define records by parsing data?
Are the data nested?
- if yes: can we reasonably un-nest the data?
Does the data reference other data?
- if yes: can we join/merge data?
What are the fields in each records?
- How are they encoded? (ex: strings, numbers, binary, dates, …)
- What is the type of the data?

Variable Types

Quantitative : Ratios and intervals have meaning
- Continuous: Could be measured to arbitrary percision (ex: price, temperature)
- Discrete: Finite possible values (ex: # of siblings, Yrs of education)
Qualitative
- Ordinal: Categories w/ levels but no consistent meaning to difference (ex: preferences, level of education) (ordered)
- Nominal: Categories w/ not specific ordering (ex: political affiliation, ID #) (non-ordered)

lifecycle

	(Qual) Continuous	(Qaul) Discrete	(Quant) Ordinal	(Quant) Nominal
\(CO_2level\) (PPM)	O
# of Siblings		O
GPA	O
Income bracket (low, mid, high)			O
Race				O
# of years of education		O
Yeb Rating (?)			O

Granularity (세분성)

how fine/coarse is each datum

eda

What does each record represent?
- ex: a purchase, as person, a group of users
do all records capture granularity at the same level?
- some data will include summaries (aka rollups) as records
if the data are coarse how was it aggregated?
- sampling, averaging…etc

Scope

how (in)complete is the data

(big enough) Does my data cover my area of interest?
- ex: studying Crimes in Korea but only having Seoul crime data
(small enough) Is my data too big?
- ex: interested in student grades for COSE471 but have all CS classes grades
- Filtering \(\Rightarrow\) Implications on sample?
  - if data==sample : may have poor coverage after filtering
Does my data cover the right time frame? (related to temporality)
Sampling Frame
: population from which the data was sampled
- sampling frame != population of interest
Revisiting Sampling Frame
- How complete/incomplete is the frame (and its data)?
- How is the frame/data situated in place?
- How well does the frame/data capture reality?
- How is the frame/data situated in time?

Temporality

how is the data situated in time

Data changes
- when was the data collected?
What’s the meaing of time and date fields? (1, 2, or 3?)
1. When the event happened
2. When the data was collected or entered into system
3. When the data was copied into a database (look for many matching timestamps)
Where does the time depend on? (Time zones & daylight savings)
- Learn to use datetime python library

Faithfulness

how well does the data capture “reality” \(\mid\) Do I trust the data?

Checklists

does my data contain unrealistic or “incorrect” values?
- dates in the future for events in the past
- non-existing locations
- negative counts
- misspellings of names
- large outliers
does my data violate obvious dependencies?
- age !== birthday
was the data entered by hand?
- spelling errors, field shifted
- forms require fields or provide default values?
are there obvious signs of Data falsification?
- repeated names, fake looking email addresses, repeateud use of uncommon names/fields
Missing Values/Default Values
- " ", 0, -1, 999, 12345, NaN, Null, 1970, 1900

Solutions on Missing Values

Drop records with missing values
- ≈ most common
- Caution: check for biases introduced by dropped values
Imputation (inferring missing values)
- Mean Imputation: replace with an average value (often use closest related subgroup mean)
- Hot deck imputation: replace with a random value
- choose a random value from the subgroup and use it for the missing value

Suggestion
- drop missing values but check for induced bias (use domain knowledge)
- directly model missing values during future analysis
  - MISSING_VALUE_TOKEN : let model learn this in the future

Solutions on other Faith Issues

Truncated data (early excel limits: 65536 Rows, 255 Columns)
- \(\Rightarrow\) be aware of consequences in analysis (how did truncation affect sample?)
Time Zone Inconsistencies
- \(\Rightarrow\) convert to a common timezone (ex: UTC)
- \(\Rightarrow\) convert to the timezone of the location - useful in modeling behavior
Duplicated Records or Fields
- \(\Rightarrow\) identify + eliminate (use primary keys) => implications on sample?
Spelling Errors
- \(\Rightarrow\) Apply corrections \(\mid\) Do drop records not in dictionary
Units not specified or consistent
- \(\Rightarrow\) infer units, check values are in reasonable ranges for data

Summary

examine data and metadata
- what is the date, size, organization, and structure of the data?
examine each field / attribute / dimension individually
examine pairs of related dimensions
- stratifying earlier analysis: break down grades by major
along the way:
- visualize/summarize the data
- validate assumptions about data and collection process
- identify and address anomalies
- apply data transformation and corrections
- Record everything you do!

Data Cleaning ⇔ EDA

Data Cleaning

Exploratory Data Analysis (EDA)

Key Data Properties of EDA

Structure

RECTANGULAR DATA

DATA FILES

KEYS AND JOINS

Questions to ask about STRUCTURE:

Variable Types

Granularity (세분성)

Scope

Temporality

Faithfulness

Checklists

Solutions on Missing Values

Solutions on other Faith Issues

Summary

Templates (for web app):

Error