@[email protected] to

[email protected]English • 6 months ago

parquet vs csv

26

parquet vs csv

@[email protected] to

[email protected]English • 6 months ago

What’s your take on parquet?

I’m still reading into it. Why is it closely related to apache? Does inly apache push it? Meaning, if apache drops it, there’d be no interest from others to push it further?

It’s published under apache hadoop license. It is a permissive license. Is there a drawback to the license?

Do you use it? When?

I assume for sharing small data, csv is sufficient. Also, I assume csv is more accessible than parquet.

Chat

@[email protected]
link
fedilink
English
3•6 months ago
I’m a data engineer, use parquet all the time and absolutely love love love it as a format!

arrow (a data format) + parquet, is particularly powerful, and lets you:

Only read the columns you need (with a csv your computer has to parse all the data even if afterwards you discard all but one column)

Use metadata to only read relevant files. This is particularly cool abd probably needs some unpacking. Say you’re reading 10 files, but only want data where “column-a” is greater than 5. Parquet can look at file headers at run time, and figure out if a file doesn’t have any column-a values over five. And therefore, never have to read it!.

Have data in an unambigious format that can be read by multiple programming languages. Since CSV is text, anything reading it will look at a value like “2022-04-05” and say “oh, this text looks like dates, let’s see what happens if I read it as dates”. Parquet contains actual data type information, so it will always be read consistently.

If you’re handling a lot of data, this kind of stuff can wind up making a huge difference.

[email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person’s post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you’re posting long videos try to add in some form of tldr for those who don’t want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

232 users / day
1.29K users / week
2.41K users / month
7.37K users / 6 months
18.8K subscribers
2.04K Posts
31.1K Comments
Modlog