@[email protected] to

[email protected]English • 3 months ago

parquet vs csv

26

parquet vs csv

@[email protected] to

[email protected]English • 3 months ago

What’s your take on parquet?

I’m still reading into it. Why is it closely related to apache? Does inly apache push it? Meaning, if apache drops it, there’d be no interest from others to push it further?

It’s published under apache hadoop license. It is a permissive license. Is there a drawback to the license?

Do you use it? When?

I assume for sharing small data, csv is sufficient. Also, I assume csv is more accessible than parquet.

You must log in or register to comment.

Chat

@[email protected]
link
fedilink
23•3 months ago
I would not recommend using parquet instead of csv. Indeed, parquet is a type of wooden flooring, while csv is a human readable file format. As you can see, it is not wise to replace one with the other. Don’t hesitate about asking more questions regarding your home design!
- @[email protected]
  link
  fedilink
  English
  14•3 months ago
Eager Eagle
link
fedilink
English
14•3 months ago
It’s pretty much an industry standard at this point, I wouldn’t be worried about its future. I’ve used it and still do once in a while.
- @Hawk
  link
  4•3 months ago
  Couldn’t agree more.
  
  Important benefits include:
  
  types (eg dates)
  
  easy import into duckdb
  
  Can be viewed with visidata
@[email protected]
link
fedilink
11•3 months ago
parquet is cloesely tied to the apache foundation, because it was designed as a storage format for hadoop.

But many data processing libraries offer interfaces to handle parquet files so you can use it outside of the hadoop eco system.

It’s really good for archiving data, because the format can store a lot of data with relatively low disk space, while still providing ok read performance because often times you won’t need to read the whole file due to how they are structured, where csv files would be a lot of plaintext taking up more diskspace.
Jim
link
fedilink
English
11•3 months ago
Do you use it? When?

Parquet is really used for big data batch data processing. It’s columnar-based file format and is optimized for large, aggregation queries. It’s non-human readable so you need a library like apache arrow to read/write to it.

I would use parquet in the following circumstances (or combination of circumstances):

The data is very large

I’m integrating this into an analytical query engine (Presto, etc.)

I’m transporting data that needs to land in an analytical data warehouse (Snowflake, BigQuery, etc.)

Consumed by data scientists, machine learning engineers, or other data engineers

Since the data is columnar-based, doing queries like select sum(sales) from revenue is much cheaper and faster if the underlying data is in parquet than csv.

The big advantage of csv is that it’s more portable. csv as a data file format has been around forever, so it is used in a lot of places where parquet can’t be used.
@[email protected]
link
fedilink
8•3 months ago
Yeah depends on what you’re using it for. CSV is terrible in many many ways but it is widely supported and much less complex.

I would guess if you’re considering Parquet then your use case is probably one where you should use it.

JSON is another option, but I would only use it if you can guarantee that you’ll never have more than like 100MB of data. Large JSON files are extremely painful.
- Eager Eagle
  link
  fedilink
  English
  1•3 months ago
  since the data is tabular, JSONL works better than JSON for large files
The Hobbyist
link
fedilink
5•3 months ago
In the deep learning community, I know of someone using parquet for the dataset and annotations. It allows you to select which data you want to retrieve from the dataset and stream only those, and nothing else. It is a rather effective method for that if you have many different annotations for different use cases and want to be able to select only the ones you need for your application.
- @[email protected]
  link
  fedilink
  English
  2•3 months ago
  How does this differ from graphQL?
  - @[email protected]
    link
    fedilink
    5•3 months ago
    Graphql is a protocol for interacting with a remote system, parquet is about having a local file that you can index and retrieve data from in a more efficient way. It’s especially useful when the data has a fairly well defined structure but may be large enough that you can’t or don’t want to bring it all into memory. They’re similar concepts, but different applications
    - @[email protected]
      link
      fedilink
      English
      3•3 months ago
      Thank you!
  - @[email protected]
    link
    fedilink
    4•3 months ago
    Parquet is a storage format; graphQL is a query language/transmission strategy.
tiredofsametab
link
fedilink
4•3 months ago
I can’t address the first part, but for your last paragraph, if you’re sharing with humans, csv is fine. If you’re sharing with humans and machines, JSON or yaml or something similar is probably fine. If you’re only moving things around to give to machines, what to use depends on constraints you might have and use cases
@[email protected]
link
fedilink
English
3•3 months ago
I’m a data engineer, use parquet all the time and absolutely love love love it as a format!

arrow (a data format) + parquet, is particularly powerful, and lets you:

Only read the columns you need (with a csv your computer has to parse all the data even if afterwards you discard all but one column)

Use metadata to only read relevant files. This is particularly cool abd probably needs some unpacking. Say you’re reading 10 files, but only want data where “column-a” is greater than 5. Parquet can look at file headers at run time, and figure out if a file doesn’t have any column-a values over five. And therefore, never have to read it!.

Have data in an unambigious format that can be read by multiple programming languages. Since CSV is text, anything reading it will look at a value like “2022-04-05” and say “oh, this text looks like dates, let’s see what happens if I read it as dates”. Parquet contains actual data type information, so it will always be read consistently.

If you’re handling a lot of data, this kind of stuff can wind up making a huge difference.
@[email protected]
link
fedilink
2•3 months ago
Parquet 4 eva

Csv is for arcane software or if you don’t know where it’s going.

Hdf5 is for Matlab interoperability

Otherwise I use parquet (orc could also work, but I never actually use it). Sometimes parquet has problems with Pandas or polars but I’ve always been able to fix it by using pyarrow
@[email protected]
link
fedilink
2•3 months ago
Friends don’t let friends use csv in 2024. Excel needs a good parquet importer and exporter today. Ya hearing Microsoft? Quit pissing around with recall and build something useful!
- @[email protected]
  link
  fedilink
  2•3 months ago
  Isn’t the question here why shouldn’t friends not let friends use CSV?
  - @[email protected]
    link
    fedilink
    1•3 months ago
    Excel mostly, csv wasn’t much of a standard and thus it’s horrible to work with. We can fix that with a parquet importer and exporter!

[email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person’s post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you’re posting long videos try to add in some form of tldr for those who don’t want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

40 users / day
475 users / week
2.31K users / month
8.98K users / 6 months
17.3K subscribers
1.82K Posts
27.9K Comments
Modlog