SaS Data Curation Professional Flashcards
(31 cards)
Data Curation ?
The process of preparing data for analytics is referred to as data curation
Data Scientist ?
Take an enormous mass of messy data points (unstructured and structured) and use their formidable skills in math, statistics, and programming to clean, manage, and organize them. Then they apply their analytic powers - industry knowlegde, contextual understanding and scepticism if existing assumptions - to uncover hidden solutions to business challenges.
Data curation lifecycle?
Data curation refers to the process of finding, exploring, structuring, cleansing, updating, and eventually archiving data. This process can be looked at as the Data Curation Life Cycle
SAS client Applications?
On the SAS Platform, users with different roles each use specialized client applications designed to accomplish specific types of tasks. With these client applications, users access application servers and data sources in order to execute processes. programming or point and click interfaces.
SAS administrator job?
Users with an administrative role use client applications to define the application servers and data source connections. The administrators also define user and group identities, logins, and permissions in the metadata to control access to application servers and data sources.
Components of a computing environment?
include the processors (also referred to as central processing units, or CPUs), memory, storage, and network.
Different data storage methods
relational database management systems, Hadoop, data lakes, and cloud storage.
RDBMS
Structured data
Predefined Schemas
SQL programming language
Hadoop
Open Source Software
Computer cluster
Distributed Storage
Parallel processing
Data Lakes
Unstructured and structured data
Large variety and volume of data
Cloud storage
Method for storing data off site
Allows for scalability depending on the amount of storage a company needs
Data is often stored across machines in the cloud
Parallel processing
The concept of breaking jobs into tasks that run simultaneously is referred to as parallel processing. Parallel processing, or parallel computing, allows for jobs to execute faster and processing to happen simultaneously on smaller tasks.
Grid computing
Grid computing systems link computer resources together in a way that lets someone use one computer to access and leverage the collected power of all the computers in the system. To the individual user, it’s as if the user’s computer has transformed into a supercomputer.
Cloud computing
Cloud computing is a broad term that refers to the immediate access to computing resources hosted over the internet. These resources can include software, data storage, processing power, and more. Amazon Web Services defines cloud computing as follows: “Cloud computing is the on-demand delivery of computer power, database, storage, applications, and other IT resources via the internet with pay-as-you-go pricing.”
IaaS
Providers of Infrastructure as a Service supply the infrastructure, which includes the basic computing resources and storage, and the users then build everything else that they need. When companies rely on IaaS providers, it can be thought of as renting servers, and their users can install operating systems and programs on the servers.
PaaS
With PaaS, a provider offers more of the application stack than IaaS providers, adding operating systems, middleware (such as databases) and other runtimes into the cloud environment.
SaaS
With Software as a Service, cloud providers host software applications. These applications are available to customers via the internet. SAS offers some SaaS products, including SAS Visual Analytics for SAS Cloud, SAS Visual Statistics for SAS Cloud, SAS Visual Data Mining and Machine Learning for SAS Cloud, and more.
SAS metadata server
The SAS Metadata Server controls access to a central repository of metadata that is shared by all SAS applications in the deployment. The metadata repository includes information about the following:
• libraries and tables that are accessed by your SAS applications
• content created and used by SAS applications, including reports and queries
• SAS and third-party servers that participate in the system
• users and groups and associated permissions
When you log on to SAS applications that are part of the SAS Platform, you first authenticate to the SAS Metadata Server.
Workspace Server
- When users of client applications submit SAS code, it is executed by a SAS Workspace Server session.
- The workspace server supports registering tables in metadata and importing data, tasks that you learn about later.
- When you submit SAS code, the metadata server starts a workspace server session that executes the code.
- SAS deployments can have multiple users submitting SAS code from client sessions, and each user is provided his or her own workspace server session.
- In addition, SAS deployments can be implemented with multiple workspace servers .
Exploring the data?
- Visualize and plot the data
- Identify anomalies and inconsistencies (missing values, incorrect data entries, spelling mistakes, casing)
- Calculate descriptive statistics
Data Governance?
the overall management of the availability, usability, integrity and security of data used in an enterprise
SAS/ACCESS technology
- SAS/ACCESS technology enables users to query and manage data stored in databases and other data sources.
- Users can manage, update, and query data using SQL that is native to the database or using SAS language.
SAS Data Integration Studio
- SAS Data Integration Studio is a SAS platform application interface that enables users to manage their data integration processes across an organisation.
- Users can create jobs using a drag -and- drop interface. These jobs (read data from the source, transform the data and load data into SAS tables) generate SAS code to access, manipulate, integrate, and store their data across a wide variety of data formats.
SAS dataflux management studio
- DataFlux Data Management Studio is a platform application interface designed for data integration and advanced data quality.
- To perform a wide variety of data quality operations, users leverage an extensive library of data quality rules and algorithms, referred to as the Quality Knowledge Base, as well as third-party reference data packs.
- These operations include standardization, entity resolution, address verification, and more.
- DataFlux Data Management Studio also has built -in tools to profile data and build business rules, enabling data quality stewards to identify and remedy issues in their data.
- Users can design automated processes to assess data for specific data quality issues and generate alerts when such issues arise.