Introduction to STATA for Data Analysis

Introduction to STATA for Data Analysis: A Hands-on Session for Beginners

Introduction to STATA

• STATA is a statistical software package used for data analysis, data management, Statistical modelling and data visualization.

• Meaning and Origin of STATA

The name STATA comes from:

“Statistics” + “Data”

Sta → Statistics

Ta → Data

Thus, STATA means:

“Statistics with Data”

Versions of STATA

• Different versions of STATA are available depending on research needs.

• Common Versions:

• STATA/BE (Basic Edition)

• STATA/SE (Special Edition)

• STATA/MP (Multiprocessor Edition)

Version	Suitable For
BE	Small datasets
SE	Medium to large datasets
MP	Very large datasets and advanced processing

Basic Philosophy of STATA

• Combines menu-driven and command-based approaches.

Ø Menu-driven Approach: Users select options from menus.

• Advantages: Beginner-friendly and Easy to learn

• Limitation: Slower for repeated analysis

Ø Command-driven Approach: Users type commands directly.

• Advantages: Faster, Reproducible, Professional workflow

• Limitation: Requires practice

Why Use STATA?

• Easy-to-use interface for beginners and researchers.

• Efficient handling of large datasets.

• Provides advanced statistical and econometric tools.

• Creates high-quality tables and graphs.

• Command-based operations

• Menu-driven analysis

• Fast data processing

• Advanced statistical tools

• Reproducible workflow using Do-files

• Save time in data analysis

Applications of STATA

Areas of Application:

• Survey data analysis

• Data cleaning and preparation

• Regression analysis

• Time series, cross sectional and panel data analysis

• Business and market research

• Public policy, health and social science research

• Hypothesis testing

• And many more

STATA combines:

• Statistical tools

• Data management capabilities

• Graphical analysis

• Programming functions

• within a single software environment.

Recommended Practice:

For beginners:

• Start with menus

• Gradually learn commands

For researchers:

• Prefer commands and do-files

Limitations of STATA

• Although STATA is powerful, it also has some limitations.

1. Paid Software

• STATA requires a license.

2. Command Learning

• Beginners may initially find commands difficult.

3. Advanced Analysis Requires Practice

• Complex econometric analysis needs deeper understanding.

Installing & opening STATA

• Go to STATA official website to install it.

• Company will provide link along with other details to download and install it, if you purchase.

STATA Interface Overview

Main Components of STATA Interface

• Menu Bar: Access to all menus and options such as File, Edit, Data, Graphics, Statistics

• Toolbar: Shortcut icons for common tasks like opening datasets, saving files, running commands

• Command Window: Used to type commands.

• Results Window: Displays outputs and analysis results.

• Variables Window: Shows all variables in the dataset.

• Review Window: Stores previously executed commands.

• Properties Window: Displays variable and dataset details.

• Do-file Editor: Write, save, and run scripts.

Understanding Dataset Structure in STATA

STATA organizes data in:

• Rows

• Columns

Understanding structure (rows and columns) is essential

Rows (Observations)

• Rows represent individual units.

• Example: One student, One household, One respondent

Columns (Variables)

• Columns represent characteristics.

• Example: Age, Gender, Income, Education

• Example Dataset Structure

• Each cell contains a value corresponding to a variable for a particular observation.

ID	Age	Gender	Income
1	22	Male	25000
2	24	Female	30000

Data Upload in STATA

Data can be entered/upload in two ways:

Ø 1^st Method- Manual data entry

• 1.1. Using the data editor option

• 1.2. Using the commands - Generate id

This is the simplest method for beginners.

Ø 2^nd Method- Importing external files

• 2.1. Import the file- Stata file (.dta file)

• 2.2. Import the file- Other files

1.1. Manual Data Entry using Data Editor

• Open data editor

• A spreadsheet-like window will open

Enter:

• Creating variables and observations

• Variable names in the column headers and observations in the row.

• Or copy-past method

Example:

• You can create variables such as: Id, age, income …etc.

• Enter respondent’s response in the row such as: 1, 25, 20k…etc.

• Numeric variables contain numbers.

Examples: 10, 20, 30

• String variables contain text.

Examples: age, income, gender, city.

• Type:

• Byte: Byte is the smallest numeric storage type in STATA.

• Double: Double stores decimal numbers with very high precision.

• Float: Float stores decimal numbers approximately.

• Int: Int stands for integer. Whole number (without decimal points)

• Long: Long stores very large integer values.

• Saving the Entered Data

• After entering data:

• save filename.dta, replace

• or

• save filename.dta

• Explanation:

• .dta = STATA data file

• replace = overwrite if file exists

1.2. Data Entry using Commands

This method is useful for small datasets.

Syntax:

• input var1 var2 var3

• values

• end

Example:

• clear

• input id age income

• 1 25 20000

• 2 30 30000

• 3 28 25000

• end

• Explanation:

• clear removes existing data from memory

• input starts data entry

• Variable names are written first

• Data is entered row by row

• end finishes the data entry process

Important Rules for Data Entry

• Variable names:

• Must start with a letter and no spaces (use underscore _ )

• Example: income_level

• Missing values:

• Represented by (.) in STATA

2. Opening Existing Files

2.1. STATA file upload

Extension: (.dta)

Ø Manual upload:

• File and open option

• Toolbar option

• Recent file option

Ø Command based upload

• use and path name

• use and path name, clear

Ø Direct open: Double click on the dataset

2.2. Other file upload

• Manual upload:

• Command based upload

• Import data from other files

• Importing Excel files

• Importing CSV files

• Importing Text files

• Importing SPSS files

• Importing SAS file

• Importing dBase file/Other Statistical Software Files

Steps

• File → Import → Excel Spreadsheet

Command

• import excel "C:\data.xlsx", firstrow

firstrow Option

• Uses first row as variable names.

Verifying Imported Data

• After importing data, researchers should verify

• Variable names and Missing values

• Number of observations

• Data format

• save command stores datasets in .dta format.

Viewing Entered Data: Data editor and Browse

• Edit Mode: edit

• Edit command allows data modification.

• Commands

• browse

• describe

• list

• Purpose

• browse → open data sheet

• describe → dataset information

• list → display observations

• Browse Mode: browse

• Browse command opens data in view-only mode.

• Useful for checking datasets.

• Commands:

• browse // read-only view

• edit // editable view

• list // shows data in output window

STATA File Types

STATA uses different file formats.

File Type	Extension	Purpose
Data File	.dta	Stores dataset
Do-file	.do	Stores commands/scripts
Log File	.log	Stores output/results

Data Management

• Data management refers to organizing and preparing data for analysis.

• Essential before statistical analysis.

• Important Data Management Tasks:

• Generating new variables

• Replace value

• Renaming variables

• Labeling variables

• Keeping variables

• Drop variables

• Merging datasets

• Reshaping Datasets

• Sorting data

Generating new variables and Replace

• New variables can be created using the generate command and observations can be modify using replace command

• Creating new variables (generate)

• Modifying value (replace)

• Generating new variables

gen income_thousand = income/1000

• Replacing Values

replace income = 50000 if income==.

• The replace command changes existing values.

Renaming variables

• rename inc income

Rules for Naming Variables

• Must begin with a letter

• No spaces allowed

• Use meaningful names

• Avoid special characters

Good Examples: income, age, gender

• Use underscore (_) if needed

• Example: education_level

Bad Examples: 1income, income data, @income

Labeling variables

• Labels improve dataset readability.

• Example: label variable income “Monthly Income”

• Saving datasets

• Value Labels

Drop variables

• drop command removes unnecessary data.

Command for Drop

• drop income

• Keeping variables

• keep command retains selected variables or observations.

• Conditional statements can be used with if command.

• Example: keep if gender== “Female”

• keep age income

• Purpose:

• Select specific observations for analysis.

Data Merging and Reshaping

• File Merging: Why It's Necessary

• Data is stored in multiple files, each containing different types of information (e.g., employment in one, output in another).

• Merging these files provides a comprehensive view of each unit.

• Identify key variables for merging: Unique ID.

• File Reshaping: why It’s Necessary

• Reshaping is needed to change the data format, either long-format (one row per observation) or wide-format (multiple variables per row).

o Helps in data analysis and visualization.

• Steps for Reshaping Data

• Long to Wide and Wide to Long:

• Identify key variables and Sr. No for reshaping data.

Sorting Data

• sort command arran ges observations.

• Example: sort income

• Useful for organizing datasets.

Purpose

• Arrange observations in order.

• Recoding Variables

• recode command standardizes categories.

• Example: recode gender (1="Male") (2="Female")

Important Basic Commands

Understanding Commands

• Commands are instructions given to STATA to perform specific tasks such as:

• describe – Displays dataset information

• summarize – Generates summary statistics

• list – Displays observations

• browse – Opens dataset in read-only mode

• clear – Removes data from memory

Structure of a Command

• Most STATA commands follow this structure:

• command variable_name

• Example:

• summarize income

• Here:

• summarize = command

• income = variable

Example

• generate income_thousand = income/1000

Do-file Editor

A Do-file is a text file containing STATA commands.

• Extension: .do

Purpose

• Saves commands permanently

• Improves reproducibility

• Useful for research projects

Advantages of Do-files

• Easy documentation and Saves time

• Reduces typing errors

• Organizes workflow

• Repeat analysis easily

• Useful for large projects

Do-file vs Command Window

• Command Window executes temporary commands.

• Do-files permanently store scripts.

• Do-files improve reproducibility and save time.

• Do-file Editor: Used to save and run scripts.

Feature	Command Window	Do-file
Saves commands	No	Yes
Reproducibility	Low	High
Best for	Quick tasks	Research workflow

Command Window	Do-file
Temporary execution	Permanent record
One command at a time	Multiple commands together
Difficult to reproduce	Easy reproducibility
Good for quick tasks	Best for research projects

Data Cleaning: Basics

• Data cleaning is the process of identifying and correcting errors in data.

• It improves accuracy and consistency.

• Identifies missing values, duplicates, and errors.

Importance

• Improves data quality

• Improves accuracy

• Removes inconsistencies

• Ensures reliable analysis

Data Cleaning

• Identifying missing values

• Handling duplicates

• Checking inconsistencies

• Recoding variables

• Preparing data for analysis

Example Commands

• duplicates report

• misstable summarize

• recode

Common Problems:

• Missing values

• Duplicate observations

• Incorrect entries

• Outliers

• Inconsistent coding

• Typing errors

Detecting Duplicate Observations

• Duplicate observations may create biased results.

• Key Message:

• Always check for duplicate records

• Command for Check Duplicates

• duplicates report

• Remove Duplicates

• duplicates drop

Missing Value in STATA

• Missing values occur when information is unavailable.

• STATA represents missing values with a dot (.)

• Checking Data Consistency

• tabulate gender

• Identifying Missing Values

misstable summarize identifies missing data.

Example: ID, age, income = 1 25 .

This means income is missing

• Handling Missing Values

• drop if income==.

Common Solutions:

Missing data can affect research results. So we can use:

• Remove observations

• Replace missing values with mean

• Use statistical imputation

Common Beginner Mistakes

• 1. Misinterpreting Mean

• Mean can be affected by outliers.

• 2. Ignoring Missing Values

• Missing data may distort results.

• 3. Wrong Variable Type

• Categorical variables should not be analyzed using inappropriate statistics.

• Best Practices

• Always inspect data before analysis

• Use descriptive statistics before advanced analysis

• Interpret results carefully

• Check for outliers and missing values

Conclusion

• Stata is a powerful and beginner-friendly software for data analysis in academia and industry.

• Key Benefits:

• Efficient data handling

• Statistical analysis

• Visualization

• Research reporting

• Regular practice is the key to becoming confident in STATA.

• The more you work with datasets, the more comfortable and confident you will become.

• Certificate of Participation will be provided to all attendees.

Introduction to STATA for Data Analysis: A Hands-on Session for Beginners

Introduction to STATA

• STATA is a statistical software package used for data analysis, data management, Statistical modelling and data visualization.

• Meaning and Origin of STATA

The name STATA comes from:

“Statistics” + “Data”

Sta → Statistics

Ta → Data

Thus, STATA means:

“Statistics with Data”

Versions of STATA

• Different versions of STATA are available depending on research needs.

• Common Versions:

• STATA/BE (Basic Edition)

• STATA/SE (Special Edition)

• STATA/MP (Multiprocessor Edition)

Version	Suitable For
BE	Small datasets
SE	Medium to large datasets
MP	Very large datasets and advanced processing

Basic Philosophy of STATA

• Combines menu-driven and command-based approaches.

Ø Menu-driven Approach: Users select options from menus.

• Advantages: Beginner-friendly and Easy to learn

• Limitation: Slower for repeated analysis

Ø Command-driven Approach: Users type commands directly.

• Advantages: Faster, Reproducible, Professional workflow

• Limitation: Requires practice

Why Use STATA?

• Easy-to-use interface for beginners and researchers.

• Efficient handling of large datasets.

• Provides advanced statistical and econometric tools.

• Creates high-quality tables and graphs.

• Command-based operations

• Menu-driven analysis

• Fast data processing

• Advanced statistical tools

• Reproducible workflow using Do-files

• Save time in data analysis

Applications of STATA

Areas of Application:

• Survey data analysis

• Data cleaning and preparation

• Regression analysis

• Time series, cross sectional and panel data analysis

• Business and market research

• Public policy, health and social science research

• Hypothesis testing

• And many more

STATA combines:

• Statistical tools

• Data management capabilities

• Graphical analysis

• Programming functions

• within a single software environment.

Recommended Practice:

For beginners:

• Start with menus

• Gradually learn commands

For researchers:

• Prefer commands and do-files

Limitations of STATA

• Although STATA is powerful, it also has some limitations.

1. Paid Software

• STATA requires a license.

2. Command Learning

• Beginners may initially find commands difficult.

3. Advanced Analysis Requires Practice

• Complex econometric analysis needs deeper understanding.

Installing & opening STATA

• Go to STATA official website to install it.

• Company will provide link along with other details to download and install it, if you purchase.

STATA Interface Overview

Main Components of STATA Interface

• Menu Bar: Access to all menus and options such as File, Edit, Data, Graphics, Statistics

• Toolbar: Shortcut icons for common tasks like opening datasets, saving files, running commands

• Command Window: Used to type commands.

• Results Window: Displays outputs and analysis results.

• Variables Window: Shows all variables in the dataset.

• Review Window: Stores previously executed commands.

• Properties Window: Displays variable and dataset details.

• Do-file Editor: Write, save, and run scripts.

Understanding Dataset Structure in STATA

STATA organizes data in:

• Rows

• Columns

Understanding structure (rows and columns) is essential

Rows (Observations)

• Rows represent individual units.

• Example: One student, One household, One respondent

Columns (Variables)

• Columns represent characteristics.

• Example: Age, Gender, Income, Education

• Example Dataset Structure

• Each cell contains a value corresponding to a variable for a particular observation.

ID	Age	Gender	Income
1	22	Male	25000
2	24	Female	30000

Data Upload in STATA

Data can be entered/upload in two ways:

Ø 1^st Method- Manual data entry

• 1.1. Using the data editor option

• 1.2. Using the commands - Generate id

This is the simplest method for beginners.

Ø 2^nd Method- Importing external files

• 2.1. Import the file- Stata file (.dta file)

• 2.2. Import the file- Other files

1.1. Manual Data Entry using Data Editor

• Open data editor

• A spreadsheet-like window will open

Enter:

• Creating variables and observations

• Variable names in the column headers and observations in the row.

• Or copy-past method

Example:

• You can create variables such as: Id, age, income …etc.

• Enter respondent’s response in the row such as: 1, 25, 20k…etc.

• Numeric variables contain numbers.

Examples: 10, 20, 30

• String variables contain text.

Examples: age, income, gender, city.

• Type:

• Byte: Byte is the smallest numeric storage type in STATA.

• Double: Double stores decimal numbers with very high precision.

• Float: Float stores decimal numbers approximately.

• Int: Int stands for integer. Whole number (without decimal points)

• Long: Long stores very large integer values.

• Saving the Entered Data

• After entering data:

• save filename.dta, replace

• or

• save filename.dta

• Explanation:

• .dta = STATA data file

• replace = overwrite if file exists

1.2. Data Entry using Commands

This method is useful for small datasets.

Syntax:

• input var1 var2 var3

• values

• end

Example:

• clear

• input id age income

• 1 25 20000

• 2 30 30000

• 3 28 25000

• end

• Explanation:

• clear removes existing data from memory

• input starts data entry

• Variable names are written first

• Data is entered row by row

• end finishes the data entry process

Important Rules for Data Entry

• Variable names:

• Must start with a letter and no spaces (use underscore _ )

• Example: income_level

• Missing values:

• Represented by (.) in STATA

2. Opening Existing Files

2.1. STATA file upload

Extension: (.dta)

Ø Manual upload:

• File and open option

• Toolbar option

• Recent file option

Ø Command based upload

• use and path name

• use and path name, clear

Ø Direct open: Double click on the dataset

2.2. Other file upload

• Manual upload:

• Command based upload

• Import data from other files

• Importing Excel files

• Importing CSV files

• Importing Text files

• Importing SPSS files

• Importing SAS file

• Importing dBase file/Other Statistical Software Files

Steps

• File → Import → Excel Spreadsheet

Command

• import excel "C:\data.xlsx", firstrow

firstrow Option

• Uses first row as variable names.

Verifying Imported Data

• After importing data, researchers should verify

• Variable names and Missing values

• Number of observations

• Data format

• save command stores datasets in .dta format.

Viewing Entered Data: Data editor and Browse

• Edit Mode: edit

• Edit command allows data modification.

• Commands

• browse

• describe

• list

• Purpose

• browse → open data sheet

• describe → dataset information

• list → display observations

• Browse Mode: browse

• Browse command opens data in view-only mode.

• Useful for checking datasets.

• Commands:

• browse // read-only view

• edit // editable view

• list // shows data in output window

STATA File Types

STATA uses different file formats.

File Type	Extension	Purpose
Data File	.dta	Stores dataset
Do-file	.do	Stores commands/scripts
Log File	.log	Stores output/results

Data Management

• Data management refers to organizing and preparing data for analysis.

• Essential before statistical analysis.

• Important Data Management Tasks:

• Generating new variables

• Replace value

• Renaming variables

• Labeling variables

• Keeping variables

• Drop variables

• Merging datasets

• Reshaping Datasets

• Sorting data

Generating new variables and Replace

• New variables can be created using the generate command and observations can be modify using replace command

• Creating new variables (generate)

• Modifying value (replace)

• Generating new variables

gen income_thousand = income/1000

• Replacing Values

replace income = 50000 if income==.

• The replace command changes existing values.

Renaming variables

• rename inc income

Rules for Naming Variables

• Must begin with a letter

• No spaces allowed

• Use meaningful names

• Avoid special characters

Good Examples: income, age, gender

• Use underscore (_) if needed

• Example: education_level

Bad Examples: 1income, income data, @income

Labeling variables

• Labels improve dataset readability.

• Example: label variable income “Monthly Income”

• Saving datasets

• Value Labels

Drop variables

• drop command removes unnecessary data.

Command for Drop

• drop income

• Keeping variables

• keep command retains selected variables or observations.

• Conditional statements can be used with if command.

• Example: keep if gender== “Female”

• keep age income

• Purpose:

• Select specific observations for analysis.

Data Merging and Reshaping

• File Merging: Why It's Necessary

• Data is stored in multiple files, each containing different types of information (e.g., employment in one, output in another).

• Merging these files provides a comprehensive view of each unit.

• Identify key variables for merging: Unique ID.

• File Reshaping: why It’s Necessary

• Reshaping is needed to change the data format, either long-format (one row per observation) or wide-format (multiple variables per row).

o Helps in data analysis and visualization.

• Steps for Reshaping Data

• Long to Wide and Wide to Long:

• Identify key variables and Sr. No for reshaping data.

Sorting Data

• sort command arran ges observations.

• Example: sort income

• Useful for organizing datasets.

Purpose

• Arrange observations in order.

• Recoding Variables

• recode command standardizes categories.

• Example: recode gender (1="Male") (2="Female")

Important Basic Commands

Understanding Commands

• Commands are instructions given to STATA to perform specific tasks such as:

• describe – Displays dataset information

• summarize – Generates summary statistics

• list – Displays observations

• browse – Opens dataset in read-only mode

• clear – Removes data from memory

Structure of a Command

• Most STATA commands follow this structure:

• command variable_name

• Example:

• summarize income

• Here:

• summarize = command

• income = variable

Example

• generate income_thousand = income/1000

Do-file Editor

A Do-file is a text file containing STATA commands.

• Extension: .do

Purpose

• Saves commands permanently

• Improves reproducibility

• Useful for research projects

Advantages of Do-files

• Easy documentation and Saves time

• Reduces typing errors

• Organizes workflow

• Repeat analysis easily

• Useful for large projects

Do-file vs Command Window

• Command Window executes temporary commands.

• Do-files permanently store scripts.

• Do-files improve reproducibility and save time.

• Do-file Editor: Used to save and run scripts.

Feature	Command Window	Do-file
Saves commands	No	Yes
Reproducibility	Low	High
Best for	Quick tasks	Research workflow

Command Window	Do-file
Temporary execution	Permanent record
One command at a time	Multiple commands together
Difficult to reproduce	Easy reproducibility
Good for quick tasks	Best for research projects

Data Cleaning: Basics

• Data cleaning is the process of identifying and correcting errors in data.

• It improves accuracy and consistency.

• Identifies missing values, duplicates, and errors.

Importance

• Improves data quality

• Improves accuracy

• Removes inconsistencies

• Ensures reliable analysis

Data Cleaning

• Identifying missing values

• Handling duplicates

• Checking inconsistencies

• Recoding variables

• Preparing data for analysis

Example Commands

• duplicates report

• misstable summarize

• recode

Common Problems:

• Missing values

• Duplicate observations

• Incorrect entries

• Outliers

• Inconsistent coding

• Typing errors

Detecting Duplicate Observations

• Duplicate observations may create biased results.

• Key Message:

• Always check for duplicate records

• Command for Check Duplicates

• duplicates report

• Remove Duplicates

• duplicates drop

Missing Value in STATA

• Missing values occur when information is unavailable.

• STATA represents missing values with a dot (.)

• Checking Data Consistency

• tabulate gender

• Identifying Missing Values

misstable summarize identifies missing data.

Example: ID, age, income = 1 25 .

This means income is missing

• Handling Missing Values

• drop if income==.

Common Solutions:

Missing data can affect research results. So we can use:

• Remove observations

• Replace missing values with mean

• Use statistical imputation

Common Beginner Mistakes

• 1. Misinterpreting Mean

• Mean can be affected by outliers.

• 2. Ignoring Missing Values

• Missing data may distort results.

• 3. Wrong Variable Type

• Categorical variables should not be analyzed using inappropriate statistics.

• Best Practices

• Always inspect data before analysis

• Use descriptive statistics before advanced analysis

• Interpret results carefully

• Check for outliers and missing values

Conclusion

• Stata is a powerful and beginner-friendly software for data analysis in academia and industry.

• Key Benefits:

• Efficient data handling

• Statistical analysis

• Visualization

• Research reporting

• Regular practice is the key to becoming confident in STATA.

• The more you work with datasets, the more comfortable and confident you will become.

• Certificate of Participation will be provided to all attendees.

Thank You and Best Wishes

Raghavendra Yadav

Global Research & Training, New Delhi

Email: info@grtedu.com | Web: www.grtedu.com

Connect with us on social media:

WhatsApp LinkedIn

X (Twitter) Facebook

Instagram Blog Youtube

Search This Blog

Global Research & Training

Introduction to STATA for Data Analysis

Comments

Post a Comment

Popular posts from this blog

The Scholar’s Roadmap for Thesis and Dissertation: A Step-by-Step Guide to Synopsis Writing

Name of Secondary Data Sources and its coverage area

Developing Article Writing Skills: A Step-by-Step Guide for Emerging Scholars