Zusammenfassung Flashcards

Question

Was sind Anwendungsbereiche für Klassifikation?

Answer 1

Kauflust von Käufern, die INteresse and Angebot haben, | Wahrscheinlichkeit über den Wechsel eines Kunden

Answer 2

Eine Vorhersage Funktion, dessen Ziel X ein begrenztes Set ist.

Answer 3

Ist X eine Teilmenge der rationales Zahlen, wird die Funktion Regressionsfunktion genannt.

Answer 4

Die Ermittlung eines Klassifikators mit der geringsten Anzahl an falschen Vorhersagen nennt sich so.

Answer 5

Gini Index, Enthropie

Answer 6

Die hierarchische Partionierung von Trainingsdaten

Answer 7

Stutzen des Baumes, verhindert ein Überlernen des Entscheidungsbaumes

Answer 8

Goal for a model: -Correct prediction for instances where the target value is unknown àLow error rate when applying the model to unknown instances - Error rate for the training set is not a good estimate for error rate in the application context -Target values are known -Rote learning of training instances results in a model with an error rate of 0 % - Error rate determined on a test data set that is not used for learning the model is a better estimate for error rate in the application context

Answer 9

Die Anzahl der Clusters k muss bei nummerischen Werte gegeben sein Die Bewertung der Partitionen mit G: - Summe über alle Clusters - Summe über alle Datensätze - Abstand eines Datensatzes zur Mitte des Clusters Ziel: Die Summe soll möglichst klein → eine Partition soll einen möglichst kleinen Wert des Varianzkriteriums haben Auswahl der Partition mit dem kleinsten Wert von G

Answer 10

-Volume Data volume in the range of 10s, 100s of terabytes or even petabytes -Velocity Speed at which the data arrives and has to be processed and analyzed - Variety Different types of data Structured Semi-structured like XML–data § Unstructured Text, voice, pictures, movies

Answer 11

Solution: 4Store the unfiltered, non-aggregated, but (hopefully) clean and unified data from the production systems in the data warehouse 4Store as well all semi-structured or unstructured data to be able extract the required information

Answer 12

§ Fast processing of high volumes of data § Flexible schemas § Ecomomic storage for tera- and peta bytes of data § High reliability and availability .. and everything at affordable costs

Answer 13

New way of storing and processing the data: § Let system handle most of the issues automatically: – Failures – Scalability – Reduce communications – Distribute data and processing power to where the data is – Make parallelism part of operating system – Relatively inexpensive hardware ($2 – 4K) § Bring processing to Data!

Answer 14

§ Data is simply copied to the file store, no transformation is needed § A serializer/deserializer is applied during read time for extracting the required columns. § New data can arrive anytime. New columns can be read once the serializer/deserializer is updated to parse it. -->Fast load -->Agility, flexibility

Answer 15

1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set)

Answer 16

Unstructured and structured § Files § Only inserts and deletes § HBase, Hive, Pig, Jaql, Big SQL § Batch processing § Data loss can happen sometimes § Simple file compression § Commodity hardware § 2-6 years old technology § Access files only (streaming) § Small number of companies using it in production, many startups

Answer 17

``` § Structured data with known schemas § Records, long fields, objects, XML § Updates allowed § SQL & XQuery § Quick response, random access § Data loss is not acceptable § Sophisticated data compression § Enterprise hardware § 30+ years old mature technology § Random access (indexing) § Large DBA and Application development community, widely used ```

Answer 18

§ Schema has to be created before data can be loaded § Data has has to be loaded to transform it into its internal structure. § New columns have to be added to a table before data with these new columns can be loaded. -->Fast read --> Standards, governance

Zusammenfassung Flashcards

(42 cards)