Python Flashcards

Question 1

Q

How would you load and explore a large dataset in Python?

Answer

A

Use the pandas library with pd.read_csv() or pd.read_parquet() to load data. Inspect with .info(), check for missing values using .isnull().sum(), and understand distribution with .describe(). Use df.sample() and df.head() for quick overviews.

pd.read_csv() is suitable for CSV files, while pd.read_parquet() is optimized for larger files in Parquet format.

Question 2

Q

How would you check data quality or integrity in Python?

Answer

A

Check for missing values, duplicates, and outliers using .isnull(), .duplicated(), and visualizations like histograms or boxplots. Validate data types and constraints using assertions or filters.

Ensuring non-negative values in columns like balances is crucial for data integrity.

Question 3

Q

What’s your process for writing clean, reusable Python code for testing or modeling?

Answer

A

Structure code into modular functions with clear input/output, use docstrings, follow naming conventions, and wrap repeatable processes into functions or classes. Version-control using Git and document assumptions and outputs.

This is particularly important in regulated environments like GRA/EIT.

Question 4

Q

How do you merge and clean data from multiple sources in Python?

Answer

A

Use pandas.merge() on key fields while handling duplicates and mismatches. Standardize column formats, strip whitespace, and use .fillna() or .dropna() based on business logic. Validate merged data by checking row counts before and after.

Testing joins on subsets first can help identify issues early.

Question 5

Q

What libraries do you use for statistical analysis and model testing in Python?

Answer

A

Use scipy for hypothesis testing, statsmodels for regression and time-series models, sklearn for machine learning, pandas and numpy for data manipulation, and matplotlib/seaborn for visualization. For test automation, use pytest or built-in unittest.

Each library serves a specific purpose in the data analysis workflow.

Question 6

Q

Explain how you would implement a test to monitor fraud detection performance over time.

Answer

A

Collect model predictions and actual outcomes, calculate metrics like precision, recall, and AUC, and track them in a dashboard. Automate with scheduled jobs using pandas, scikit-learn, and matplotlib.

Setting alerts for performance drops can help maintain model efficacy.

Question 7

Q

How do you ensure that your Python code for models or tests is repeatable and auditable?

Answer

A

Use version control (Git), document assumptions, avoid hardcoding values, log intermediate steps with the logging module, and store configurations in external files. Structure code into scripts/notebooks for end-to-end rerun capability.

This practice enhances reproducibility and auditing.

Question 8

Q

What’s the difference between a shallow copy and a deep copy in Python?

Answer

A

A shallow copy creates a new object referencing original elements, while a deep copy copies all nested objects recursively. This matters in data processing to prevent unintended modifications of original data.

In pandas, use .copy(deep=True) to ensure a true deep copy.

Question 9

Q

How do you use Python to automate a manual reporting task?

Answer

A

Connect to data sources using sqlalchemy, clean and analyze data with pandas, and export results to Excel or PDF with openpyxl or matplotlib. Schedule the script with Windows Task Scheduler or cron.

Automating reporting tasks increases consistency and reduces manual errors.

Question 10

Q

What’s your experience using Python in cross-functional teams?

Answer

A

Collaborated with teammates focused on modeling and data. Ensured code was well-documented and modular for reuse, and shared Jupyter notebooks for visual summaries to bridge understanding.

Effective communication and documentation are key in cross-functional teams.

Python Flashcards

(10 cards)