Introduction to STATA for Data Analysis


 Introduction to STATA for Data Analysis: A Hands-on Session for Beginners


Introduction to STATA

         STATA is a statistical software package used for data analysis, data management, Statistical modelling and data visualization.

        Meaning and Origin of STATA

The name STATA comes from:

“Statistics” + “Data”

Sta → Statistics

Ta → Data

Thus, STATA means:

“Statistics with Data”

 

Versions of STATA

        Different versions of STATA are available depending on research needs.

        Common Versions:

        STATA/BE (Basic Edition)

        STATA/SE (Special Edition)

        STATA/MP (Multiprocessor Edition)

 

Version

Suitable For

BE

Small datasets

SE

Medium to large datasets

MP

Very large datasets and advanced processing

 

 

Basic Philosophy of STATA

        Combines menu-driven and command-based approaches.

Ø   Menu-driven Approach: Users select options from menus.

        Advantages: Beginner-friendly and Easy to learn

        Limitation: Slower for repeated analysis

Ø   Command-driven Approach: Users type commands directly.

        Advantages: Faster, Reproducible, Professional workflow

        Limitation: Requires practice

 

 

Why Use STATA?

        Easy-to-use interface for beginners and researchers.

        Efficient handling of large datasets.

        Provides advanced statistical and econometric tools.

        Creates high-quality tables and graphs.

        Command-based operations

        Menu-driven analysis

        Fast data processing

        Advanced statistical tools

        Reproducible workflow using Do-files

        Save time in data analysis

 

Applications of STATA

Areas of Application:

        Survey data analysis

        Data cleaning and preparation

        Regression analysis

        Time series, cross sectional and panel data analysis

        Business and market research

        Public policy, health and social science research

        Hypothesis testing

        And many more

 

STATA combines:

        Statistical tools

        Data management capabilities

        Graphical analysis

        Programming functions

        within a single software environment.

Recommended Practice:

For beginners:

        Start with menus

        Gradually learn commands

For researchers:

        Prefer commands and do-files

 

Limitations of STATA

        Although STATA is powerful, it also has some limitations.

1. Paid Software

        STATA requires a license.

2. Command Learning

        Beginners may initially find commands difficult.

3. Advanced Analysis Requires Practice

        Complex econometric analysis needs deeper understanding.


Installing & opening STATA

        Go to STATA official website to install it.

Or

        Company will provide link along with other details to download and install it, if you purchase.

 

STATA Interface Overview

Main Components of STATA Interface

        Menu Bar: Access to all menus and options such as File, Edit, Data, Graphics, Statistics

        Toolbar: Shortcut icons for common tasks like opening datasets, saving files, running commands

        Command Window: Used to type commands.

        Results Window: Displays outputs and analysis results.

        Variables Window: Shows all variables in the dataset.

        Review Window: Stores previously executed commands.

        Properties Window: Displays variable and dataset details.

        Do-file Editor: Write, save, and run scripts.


 

Understanding Dataset Structure in STATA

STATA organizes data in:

        Rows

        Columns

Understanding structure (rows and columns) is essential

Rows (Observations)

        Rows represent individual units.

        Example: One student, One household, One respondent

Columns (Variables)

        Columns represent characteristics.

        Example: Age, Gender, Income, Education

 

        Example Dataset Structure

        Each cell contains a value corresponding to a variable for a particular observation.

ID

Age

Gender

Income

1

22

Male

25000

2

24

Female

30000


Data Upload in STATA

Data can be entered/upload in two ways:

Ø   1st Method- Manual data entry

        1.1. Using the data editor option

        1.2. Using the commands - Generate id

This is the simplest method for beginners.

Ø   2nd Method- Importing external files

        2.1. Import the file- Stata file (.dta file)

        2.2. Import the file- Other files

 

1.1.  Manual Data Entry using Data Editor

        Open data editor

        A spreadsheet-like window will open

Enter:

        Creating variables and observations

        Variable names in the column headers and observations in the row.

        Or copy-past method

Example:

        You can create variables such as: Id, age, income …etc.

        Enter respondent’s response in the row such as: 1, 25, 20k…etc.

 

        Numeric variables contain numbers.

Examples: 10, 20, 30

        String variables contain text.

Examples: age, income, gender, city.

        Type:

        Byte: Byte is the smallest numeric storage type in STATA.

        Double: Double stores decimal numbers with very high precision.

        Float: Float stores decimal numbers approximately.

        Int: Int stands for integer. Whole number (without decimal points)

        Long: Long stores very large integer values.

 

        Saving the Entered Data

        After entering data:

        save filename.dta, replace

        or

        save filename.dta

        Explanation:

        .dta = STATA data file

        replace = overwrite if file exists

 

1.2.  Data Entry using Commands

This method is useful for small datasets.

Syntax:

        input var1 var2 var3

        values

        end

Example:

        clear

        input id age income

        1 25 20000

        2 30 30000

        3 28 25000

        end

 

        Explanation:

        clear removes existing data from memory

        input starts data entry

        Variable names are written first

        Data is entered row by row

        end finishes the data entry process

Important Rules for Data Entry

        Variable names:

        Must start with a letter and no spaces (use underscore _ )

        Example: income_level

        Missing values:

        Represented by (.) in STATA

 

2.       Opening Existing Files

2.1. STATA file upload

Extension: (.dta)

Ø   Manual upload:

        File and open option

        Toolbar option

        Recent file option

Ø   Command based upload

        use and path name 

        use and path name, clear

Ø   Direct open: Double click on the dataset

 

 

2.2. Other file upload

        Manual upload:

        Command based upload

        Import data from other files

        Importing Excel files

        Importing CSV files

        Importing Text files

        Importing SPSS files

        Importing SAS file

        Importing dBase file/Other Statistical Software Files

 

Steps

        File → Import → Excel Spreadsheet

Command

        import excel "C:\data.xlsx", firstrow

firstrow Option

        Uses first row as variable names.

Verifying Imported Data

        After importing data, researchers should verify

        Variable names and Missing values

        Number of observations

        Data format

        save command stores datasets in .dta format.   


         Viewing Entered Data: Data editor and Browse

        Edit Mode: edit

        Edit command allows data modification.

        Commands

        browse

        describe

        list

        Purpose

        browse → open data sheet

        describe → dataset information

        list → display observations

 

        Browse Mode: browse

        Browse command opens data in view-only mode.

        Useful for checking datasets.

        Commands:

        browse   // read-only view

        edit     // editable view

        list     // shows data in output window

 

STATA File Types

STATA uses different file formats.

File Type

Extension

Purpose

Data File

.dta

Stores dataset

Do-file

.do

Stores commands/scripts

Log File

.log

Stores output/results

 

 

Data Management

         Data management refers to organizing and preparing data for analysis.

        Essential before statistical analysis.

        Important Data Management Tasks:

        Generating new variables

        Replace value

        Renaming variables

        Labeling variables

        Keeping variables

        Drop variables

        Merging datasets

        Reshaping Datasets

        Sorting data

 

Generating new variables and Replace

        New variables can be created using the generate command and observations can be modify using replace command

        Creating new variables (generate)

        Modifying value (replace)

        Generating new variables

gen income_thousand = income/1000

        Replacing Values

replace income = 50000 if income==.

        The replace command changes existing values.

 

Renaming variables

        rename inc income

Rules for Naming Variables

        Must begin with a letter

        No spaces allowed

        Use meaningful names

        Avoid special characters

Good Examples: income, age, gender

        Use underscore (_) if needed

        Example: education_level

Bad Examples: 1income, income data, @income

 

Labeling variables

        Labels improve dataset readability.

        Example: label variable income “Monthly Income”

        Saving datasets

        Value Labels

Drop variables

        drop command removes unnecessary data.

Command for Drop

        drop income

 

        Keeping variables

        keep command retains selected variables or observations.

        Conditional statements can be used with if command.

        Example: keep if gender== “Female”

        keep age income

        Purpose:

        Select specific observations for analysis.

 

Data Merging and Reshaping

        File Merging: Why It's Necessary

        Data is stored in multiple files, each containing different types of information (e.g., employment in one, output in another).

        Merging these files provides a comprehensive view of each unit.

        Identify key variables for merging: Unique ID. 

 

        File Reshaping: why It’s Necessary

        Reshaping is needed to change the data format, either long-format (one row per observation) or wide-format (multiple variables per row).

o   Helps in data analysis and visualization.

        Steps for Reshaping Data

        Long to Wide and Wide to Long:

        Identify key variables and Sr. No for reshaping data.

 

Sorting Data

        sort command arran ges observations.

        Example: sort income

        Useful for organizing datasets.

Purpose

        Arrange observations in order.

        Recoding Variables

        recode command standardizes categories.

        Example: recode gender (1="Male") (2="Female")

 

Important Basic Commands

Understanding Commands

        Commands are instructions given to STATA to perform specific tasks such as:

        describe – Displays dataset information

        summarize – Generates summary statistics

        list – Displays observations

        browse – Opens dataset in read-only mode

        clear – Removes data from memory

Structure of a Command

        Most STATA commands follow this structure:

        command variable_name

        Example:

        summarize income

        Here:

        summarize = command

        income = variable

Example

        generate income_thousand = income/1000

 

Do-file Editor

A Do-file is a text file containing STATA commands.

        Extension: .do

Purpose

        Saves commands permanently

        Improves reproducibility

        Useful for research projects

Advantages of Do-files

        Easy documentation and Saves time

        Reduces typing errors

        Organizes workflow

        Repeat analysis easily

        Useful for large projects

 

 

 

Do-file vs Command Window

        Command Window executes temporary commands.

        Do-files permanently store scripts.

        Do-files improve reproducibility and save time.

        Do-file Editor: Used to save and run scripts.

 

Feature

Command Window

Do-file

Saves commands

No

Yes

Reproducibility

Low

High

Best for

Quick tasks

Research workflow

 

Command Window

Do-file

Temporary execution

Permanent record

One command at a time

Multiple commands together

Difficult to reproduce

Easy reproducibility

Good for quick tasks

Best for research projects

 

Data Cleaning: Basics

        Data cleaning is the process of identifying and correcting errors in data.

        It improves accuracy and consistency.

        Identifies missing values, duplicates, and errors.

Importance

        Improves data quality

        Improves accuracy

        Removes inconsistencies

        Ensures reliable analysis

 Image

 

Data Cleaning

        Identifying missing values

        Handling duplicates

        Checking inconsistencies

        Recoding variables

        Preparing data for analysis

Example Commands

        duplicates report

        misstable summarize

        recode

 

Common Problems:

        Missing values

        Duplicate observations

        Incorrect entries

        Outliers

        Inconsistent coding

        Typing errors

 

Detecting Duplicate Observations

        Duplicate observations may create biased results.

        Key Message:

        Always check for duplicate records

        Command for Check Duplicates

        duplicates report

        Remove Duplicates

        duplicates drop

 

Missing Value in STATA

        Missing values occur when information is unavailable.

        STATA represents missing values with a dot (.)

        Checking Data Consistency

        tabulate gender

        Identifying Missing Values

misstable summarize identifies missing data.

Example: ID, age, income = 1 25 .

This means income is missing

        Handling Missing Values

        drop if income==.

Common Solutions:

Missing data can affect research results. So we can use:

        Remove observations

        Replace missing values with mean

        Use statistical imputation

 

Common Beginner Mistakes

        1. Misinterpreting Mean

        Mean can be affected by outliers.

        2. Ignoring Missing Values

        Missing data may distort results.

        3. Wrong Variable Type

        Categorical variables should not be analyzed using inappropriate statistics.

        Best Practices

        Always inspect data before analysis

        Use descriptive statistics before advanced analysis

        Interpret results carefully

        Check for outliers and missing values

 

Conclusion

         Stata is a powerful and beginner-friendly software for data analysis in academia and industry.

        Key Benefits:

        Efficient data handling

        Statistical analysis

        Visualization

        Research reporting

        Regular practice is the key to becoming confident in STATA.

        The more you work with datasets, the more comfortable and confident you will become.

        Certificate of Participation will be provided to all attendees.

 

Introduction to STATA for Data Analysis: A Hands-on Session for Beginners

Introduction to STATA

 

        STATA is a statistical software package used for data analysis, data management, Statistical modelling and data visualization.

        Meaning and Origin of STATA

The name STATA comes from:

“Statistics” + “Data”

Sta → Statistics

Ta → Data

Thus, STATA means:

“Statistics with Data”

 

Versions of STATA

        Different versions of STATA are available depending on research needs.

        Common Versions:

        STATA/BE (Basic Edition)

        STATA/SE (Special Edition)

        STATA/MP (Multiprocessor Edition)

 

Version

Suitable For

BE

Small datasets

SE

Medium to large datasets

MP

Very large datasets and advanced processing

 

 

Basic Philosophy of STATA

        Combines menu-driven and command-based approaches.

Ø   Menu-driven Approach: Users select options from menus.

        Advantages: Beginner-friendly and Easy to learn

        Limitation: Slower for repeated analysis

Ø   Command-driven Approach: Users type commands directly.

        Advantages: Faster, Reproducible, Professional workflow

        Limitation: Requires practice

 

 

Why Use STATA?

        Easy-to-use interface for beginners and researchers.

        Efficient handling of large datasets.

        Provides advanced statistical and econometric tools.

        Creates high-quality tables and graphs.

        Command-based operations

        Menu-driven analysis

        Fast data processing

        Advanced statistical tools

        Reproducible workflow using Do-files

        Save time in data analysis

 

Applications of STATA

Areas of Application:

        Survey data analysis

        Data cleaning and preparation

        Regression analysis

        Time series, cross sectional and panel data analysis

        Business and market research

        Public policy, health and social science research

        Hypothesis testing

        And many more

 

STATA combines:

        Statistical tools

        Data management capabilities

        Graphical analysis

        Programming functions

        within a single software environment.

Recommended Practice:

For beginners:

        Start with menus

        Gradually learn commands

For researchers:

        Prefer commands and do-files

 

Limitations of STATA

        Although STATA is powerful, it also has some limitations.

1. Paid Software

        STATA requires a license.

2. Command Learning

        Beginners may initially find commands difficult.

3. Advanced Analysis Requires Practice

        Complex econometric analysis needs deeper understanding.

 

Installing & opening STATA

        Go to STATA official website to install it.

Or

        Company will provide link along with other details to download and install it, if you purchase.

 

STATA Interface Overview

Main Components of STATA Interface

        Menu Bar: Access to all menus and options such as File, Edit, Data, Graphics, Statistics

        Toolbar: Shortcut icons for common tasks like opening datasets, saving files, running commands

        Command Window: Used to type commands.

        Results Window: Displays outputs and analysis results.

        Variables Window: Shows all variables in the dataset.

        Review Window: Stores previously executed commands.

        Properties Window: Displays variable and dataset details.

        Do-file Editor: Write, save, and run scripts.

  

 

Understanding Dataset Structure in STATA

STATA organizes data in:

        Rows

        Columns

Understanding structure (rows and columns) is essential

Rows (Observations)

        Rows represent individual units.

        Example: One student, One household, One respondent

Columns (Variables)

        Columns represent characteristics.

        Example: Age, Gender, Income, Education

 

        Example Dataset Structure

        Each cell contains a value corresponding to a variable for a particular observation.

ID

Age

Gender

Income

1

22

Male

25000

2

24

Female

30000

 

 

Data Upload in STATA

Data can be entered/upload in two ways:

Ø   1st Method- Manual data entry

        1.1. Using the data editor option

        1.2. Using the commands - Generate id

This is the simplest method for beginners.

Ø   2nd Method- Importing external files

        2.1. Import the file- Stata file (.dta file)

        2.2. Import the file- Other files

 

 

1.1.  Manual Data Entry using Data Editor

        Open data editor

        A spreadsheet-like window will open

Enter:

        Creating variables and observations

        Variable names in the column headers and observations in the row.

        Or copy-past method

Example:

        You can create variables such as: Id, age, income …etc.

        Enter respondent’s response in the row such as: 1, 25, 20k…etc.

 

        Numeric variables contain numbers.

Examples: 10, 20, 30

        String variables contain text.

Examples: age, income, gender, city.

        Type:

        Byte: Byte is the smallest numeric storage type in STATA.

        Double: Double stores decimal numbers with very high precision.

        Float: Float stores decimal numbers approximately.

        Int: Int stands for integer. Whole number (without decimal points)

        Long: Long stores very large integer values.

 

        Saving the Entered Data

        After entering data:

        save filename.dta, replace

        or

        save filename.dta

        Explanation:

        .dta = STATA data file

        replace = overwrite if file exists

  

1.2.  Data Entry using Commands

This method is useful for small datasets.

Syntax:

        input var1 var2 var3

        values

        end

Example:

        clear

        input id age income

        1 25 20000

        2 30 30000

        3 28 25000

        end

 

        Explanation:

        clear removes existing data from memory

        input starts data entry

        Variable names are written first

        Data is entered row by row

        end finishes the data entry process

Important Rules for Data Entry

        Variable names:

        Must start with a letter and no spaces (use underscore _ )

        Example: income_level

        Missing values:

        Represented by (.) in STATA

 

2.       Opening Existing Files

2.1. STATA file upload

Extension: (.dta)

Ø   Manual upload:

        File and open option

        Toolbar option

        Recent file option

Ø   Command based upload

        use and path name 

        use and path name, clear

Ø   Direct open: Double click on the dataset


2.2. Other file upload

        Manual upload:

        Command based upload

        Import data from other files

        Importing Excel files

        Importing CSV files

        Importing Text files

        Importing SPSS files

        Importing SAS file

        Importing dBase file/Other Statistical Software Files

 

Steps

        File → Import → Excel Spreadsheet

Command

        import excel "C:\data.xlsx", firstrow

firstrow Option

        Uses first row as variable names.

Verifying Imported Data

        After importing data, researchers should verify

        Variable names and Missing values

        Number of observations

        Data format

        save command stores datasets in .dta format.

 

Viewing Entered Data: Data editor and Browse

 

        Edit Mode: edit

        Edit command allows data modification.

        Commands

        browse

        describe

        list

        Purpose

        browse → open data sheet

        describe → dataset information

        list → display observations

 

        Browse Mode: browse

        Browse command opens data in view-only mode.

        Useful for checking datasets.

        Commands:

        browse   // read-only view

        edit     // editable view

        list     // shows data in output window

  

STATA File Types

STATA uses different file formats.

File Type

Extension

Purpose

Data File

.dta

Stores dataset

Do-file

.do

Stores commands/scripts

Log File

.log

Stores output/results

 

Data Management

        Data management refers to organizing and preparing data for analysis.

        Essential before statistical analysis.

        Important Data Management Tasks:

        Generating new variables

        Replace value

        Renaming variables

        Labeling variables

        Keeping variables

        Drop variables

        Merging datasets

        Reshaping Datasets

        Sorting data


Generating new variables and Replace

        New variables can be created using the generate command and observations can be modify using replace command

        Creating new variables (generate)

        Modifying value (replace)

        Generating new variables

gen income_thousand = income/1000

        Replacing Values

replace income = 50000 if income==.

        The replace command changes existing values.

 

Renaming variables

        rename inc income

Rules for Naming Variables

        Must begin with a letter

        No spaces allowed

        Use meaningful names

        Avoid special characters

Good Examples: income, age, gender

        Use underscore (_) if needed

        Example: education_level

Bad Examples: 1income, income data, @income

 

Labeling variables

        Labels improve dataset readability.

        Example: label variable income “Monthly Income”

        Saving datasets

        Value Labels

Drop variables

        drop command removes unnecessary data.

Command for Drop

        drop income

 

        Keeping variables

        keep command retains selected variables or observations.

        Conditional statements can be used with if command.

        Example: keep if gender== “Female”

        keep age income

        Purpose:

        Select specific observations for analysis.

 

Data Merging and Reshaping

        File Merging: Why It's Necessary

        Data is stored in multiple files, each containing different types of information (e.g., employment in one, output in another).

        Merging these files provides a comprehensive view of each unit.

        Identify key variables for merging: Unique ID. 

 

        File Reshaping: why It’s Necessary

        Reshaping is needed to change the data format, either long-format (one row per observation) or wide-format (multiple variables per row).

o   Helps in data analysis and visualization.

        Steps for Reshaping Data

        Long to Wide and Wide to Long:

        Identify key variables and Sr. No for reshaping data.

 

Sorting Data

        sort command arran ges observations.

        Example: sort income

        Useful for organizing datasets.

Purpose

        Arrange observations in order.

        Recoding Variables

        recode command standardizes categories.

        Example: recode gender (1="Male") (2="Female")

 

Important Basic Commands

Understanding Commands

        Commands are instructions given to STATA to perform specific tasks such as:

        describe – Displays dataset information

        summarize – Generates summary statistics

        list – Displays observations

        browse – Opens dataset in read-only mode

        clear – Removes data from memory

Structure of a Command

        Most STATA commands follow this structure:

        command variable_name

        Example:

        summarize income

        Here:

        summarize = command

        income = variable

Example

        generate income_thousand = income/1000

 

Do-file Editor

A Do-file is a text file containing STATA commands.

        Extension: .do

Purpose

        Saves commands permanently

        Improves reproducibility

        Useful for research projects

Advantages of Do-files

        Easy documentation and Saves time

        Reduces typing errors

        Organizes workflow

        Repeat analysis easily

        Useful for large projects

 

 

Do-file vs Command Window

        Command Window executes temporary commands.

        Do-files permanently store scripts.

        Do-files improve reproducibility and save time.

        Do-file Editor: Used to save and run scripts.

 

Feature

Command Window

Do-file

Saves commands

No

Yes

Reproducibility

Low

High

Best for

Quick tasks

Research workflow

 

 

Command Window

Do-file

Temporary execution

Permanent record

One command at a time

Multiple commands together

Difficult to reproduce

Easy reproducibility

Good for quick tasks

Best for research projects

 

 Data Cleaning: Basics

        Data cleaning is the process of identifying and correcting errors in data.

        It improves accuracy and consistency.

        Identifies missing values, duplicates, and errors.

Importance

        Improves data quality

        Improves accuracy

        Removes inconsistencies

        Ensures reliable analysis 

Image

 Data Cleaning

        Identifying missing values

        Handling duplicates

        Checking inconsistencies

        Recoding variables

        Preparing data for analysis

Example Commands

        duplicates report

        misstable summarize

        recode

  

Common Problems:

        Missing values

        Duplicate observations

        Incorrect entries

        Outliers

        Inconsistent coding

        Typing errors

 

 Detecting Duplicate Observations

        Duplicate observations may create biased results.

        Key Message:

        Always check for duplicate records

        Command for Check Duplicates

        duplicates report

        Remove Duplicates

        duplicates drop

 

Missing Value in STATA

        Missing values occur when information is unavailable.

        STATA represents missing values with a dot (.)

        Checking Data Consistency

        tabulate gender

        Identifying Missing Values

misstable summarize identifies missing data.

Example: ID, age, income = 1 25 .

This means income is missing

        Handling Missing Values

        drop if income==.

Common Solutions:

Missing data can affect research results. So we can use:

        Remove observations

        Replace missing values with mean

        Use statistical imputation

 

Common Beginner Mistakes

        1. Misinterpreting Mean

        Mean can be affected by outliers.

        2. Ignoring Missing Values

        Missing data may distort results.

        3. Wrong Variable Type

        Categorical variables should not be analyzed using inappropriate statistics.

        Best Practices

        Always inspect data before analysis

        Use descriptive statistics before advanced analysis

        Interpret results carefully

        Check for outliers and missing values

 

 Conclusion

         Stata is a powerful and beginner-friendly software for data analysis in academia and industry.

        Key Benefits:

        Efficient data handling

        Statistical analysis

        Visualization

        Research reporting

        Regular practice is the key to becoming confident in STATA.

        The more you work with datasets, the more comfortable and confident you will become.

        Certificate of Participation will be provided to all attendees.

 

 


Thank You and Best Wishes


Raghavendra Yadav

Global Research & Training, New Delhi

Email: info@grtedu.com | Web: www.grtedu.com 


Connect with us on social media:

WhatsApp                                             LinkedIn

X (Twitter)                                            Facebook

Instagram                          Blog               Youtube 






Comments

Popular posts from this blog

The Scholar’s Roadmap for Thesis and Dissertation: A Step-by-Step Guide to Synopsis Writing

Name of Secondary Data Sources and its coverage area

Developing Article Writing Skills: A Step-by-Step Guide for Emerging Scholars