Skip to content

Commit 3d018ae

Browse files
committed
made some changes to SparkSQLTableDemo.py
1. Use partition 2. Use bucket with sortby
1 parent 049167c commit 3d018ae

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

06-SparkSQLTableDemo/SparkSQLTableDemo.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,25 @@
1919
spark.sql("CREATE DATABASE IF NOT EXISTS AIRLINE_DB")
2020
spark.catalog.setCurrentDatabase("AIRLINE_DB")
2121

22+
# flightTimeParquetDF.write \
23+
# .mode("overwrite") \
24+
# .saveAsTable("flight_data_tbl")
25+
26+
# Partition by ORIGIN, OP_CARRIER
27+
# flightTimeParquetDF.write \
28+
# .mode("overwrite") \
29+
# .partitionBy("ORIGIN", "OP_CARRIER") \
30+
# .saveAsTable("flight_data_tbl")
31+
32+
# Above implementation will cause too many partition
33+
# Lets use bucket instead, choose 5 buckets only, it will be computed based on hash and modulus
34+
# Since the unique combination of ORIGIN & OP_CARRIER will fall into same bucket
35+
# We will sort it as well
2236
flightTimeParquetDF.write \
37+
.format("csv") \
2338
.mode("overwrite") \
39+
.bucketBy(5, "ORIGIN", "OP_CARRIER") \
40+
.sortBy("OP_CARRIER", "ORIGIN") \
2441
.saveAsTable("flight_data_tbl")
2542

2643
logger.info(spark.catalog.listTables("AIRLINE_DB"))

0 commit comments

Comments
 (0)