- Practical Predictive Analytics
- Ralph Winters
- 409字
- 2025-04-04 19:02:43
Database tables
Database tables reside within an enterprise data warehouse (EDW) such as MySQL, SQL Server, or Oracle, or they can be housed on local PCs. This data can be accessed using R provided correct authentication credentials are supplied. Enterprise data contains some of the best data that you can get your hands on. It is usually considered an official data source that is used across a company. Additionally, in a capable enterprise data warehouse system, data is already vetted, so that saves you the work of checking for data quality. However, data extracted from a data warehouse can also be complex, and often needs a good data dictionary and metadata to accompany it.
Data warehouse tables are often optimized for speed. However, it is necessary to understand the logical and physical data structure of the warehouse, along with the column index structure. Knowing this will improve your query performance.
In an EDW, data is often organized along subject lines, with separate specialized tables for each purpose. Sometimes this leads to redundancy, but the advantage is that all the data you need may be located in just a few tables.
Reading data directly from database tables circumvents the R memory limitation, to some extent, since some of the processing is done externally to your local PC. In some instances, parallel processing can be used, which will allow you to access very large datasets.
The usual way of accessing data stored in relational databases is through SQL, ODBC, or JDBC connection. Some systems will accept a direct connection. Some specific packages used to connect to databases are listed here:
- The RODBC package provides an interface to SQLServer
- The RMySQL package is used to connect to MySQL databases
- The ROracle, RJDBC, and RODBC packages can be used to connect to Oracle databases
The universal connector for enterprise databases is ODBC connector. Since it is a universal and is single threaded, ODBC tends to be slow. RJDBC provides access via a JDBC interface. JDBC is faster than ODBC but will require more tweaking. Since connecting to a database is a little more involved than just reading a text file, it will take a little more time and work to get through login IDs, passwords, and optimizing queries. But once you are familiar with your environment, it will be worth the time spent to optimize access, and avoid calls from the DBA letting you know that your queries are taking too long.