Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. How do I import a large 6 Gb. Ask Question. Asked 4 years, 3 months ago. Active 4 years, 3 months ago. Viewed 10k times. Improve this question. My approach to running queries on very large compressed csv files: stackoverflow. Add a comment. He currently focuses the statistical assessment of adverse weather events and natural hazards, and disaster risk reduction.
His main interests are statistical modelling of environmental phenomena as well as open source tools for data science, geoinformation and remote sensing. Our random data set will feature more than 8 million rows and 8 columns and comprises around MB of data: library data.
Share This Post:. I'm not clear on your goal, but if you're trying to read all of these files into a single R data structure, then I see two major performance concerns: File access times - from the moment you request read.
I would expect that this would be a nearly-constant slowdown as you read in millions of files. Growing your single data structure with each new file read. Every time you want to add a few rows to your matrix, you'll likely be needing to reallocate a similarly sized chunk of memory in order to store the larger matrix. If you're growing your array 15 million times, you'll certainly notice a performance slow-down here. With this problem, the performance will get progressively worse as your read in more files.
Regarding solutions, I'd say you could start with two things: Combine the CSV files in another programming language. A simple shell script would likely do the job for you if you're just looping through files and concatenating them into a single large file. As Joshua and Richie mention below, you may be able to optimize this without having to deviate to another language by using the more efficient scan or readlines functions.
Pre-size your unified data structure. That will ensure that you only have to find room in memory for this object once, and the rest of the operations will just insert data into the pre-sized matrix.
Improve this answer. Jeff Allen Jeff Allen Also note that scan is more appropriate than read. Jeff - Thanks for this detailed answer. I don't think I can combine all the files into one big one, because I need them to be separated for subsequent analysis. Each file represents an execution of my experiment.
JoshuaUlrich: I tried scan instead of read. Add a comment. Marek Marek Thanks guys. If you are interested why, read the rest of this section. Streaming a file means reading it line by line and only keeping the lines you need or do stuff with the lines while you read through the file. It turns out that R is really not very efficient in streaming files.
The main reason is the memory allocation process that has difficulties with a constantly growing object which can be a dataframe containing only the selected lines. In the next code block, we will read parts of our data file once using the fread function, and once line by line. SQLite databases are single file databases meaning you can simply download them, store them in a folder or share them with colleagues. Similar to a csv. We have downloaded a second file processed-logs-big-file-example.
Furthermore, the database file contains indexes which will dramatically drop the time needed to perform search queries. If you do not have a SQLite database containing your data, you can first convert your csv into a SQlite as described further in this tutorial.
This provides a convenient and fast way to request subsets of data from our large data file. We could do the same analysis for each of the serial numbers, each time only loading that subset of the data.
By using a for loop, the calculation is done for each of the birds separately and the amount of data loaded into memory at the same time is lower:.
Remark that we use the sprintf function to dynamically replace the serial id in the sqlite query we will execute. Read the manual of the sprintf function for more information and options. However, dplyr will translate your commands to SQL, allowing you to take advantage of the indexes in the SQLite database.
Dplyr provides the ability to perform queries as above without the need to know SQL. If you want to learn more about how to use dplyr with a SQLite database, head over to this vignette. In the case you have a CSV file available and you would like to query the data using SQL queries or with dplyr as shown in the previous sections, you can decide to convert the data to a SQlite database.
The conversion will require some time, but once available, it provides the opportunity to query the data using SQL queries or with dplyr as shown in the previous sections.
0コメント